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Structural classification shows that the number of different protein folds is surprisingly small. 
It also appears that proteins are built in a modular fashion, from a relatively small number of 
components. Here we propose to identify the modular building blocks of proteins with the dark 
soliton solution of a generalized discrete nonlinear Schrodinger equation. For this we show that 
practically all protein loops can be obtained simply by scaling the size and by joining together a 
number of copies of the soliton, one after another. The soliton has only two loop specific parameters 
and we identify their possible values in Protein Data Bank. We show that with a collection of 200 
sets of parameters, each determining a soliton profile that describes a different short loop, we cover 
over 90 % of all proteins with experimental accuracy. We also present two examples that describe 
how the loop library can be employed both to model and to analyze the structure of folded proteins. 



I. INTRODUCTION 

Proteins come in many shapes, but the number of dif- 
ferent folds is definitely much smaller than suggested by 
Levinthal's estimate [T]. For example, thus far the struc- 
tural classification scheme SCOP ^ has identified 1393 
unique folds while in CATH [3] here are currently 1282 
topologies. These figures have not changed since the year 
2008, indicating that the number of different protein con- 
formations is quite limited and probably most of them 
have already been observed. Furthermore, the great suc- 
cess of SCOP, CATH and other approaches such as FSSP 
[3] in classifying the architecture of proteins is a manifes- 
tation that proteins are built in a modular fashion from 
a relatively small number of different components. 

Here we advocate a quantitative energy function based 
approach to identify and classify the modular compo- 
nents of proteins. We propose to utilize the dark soliton 
solution of a generalized discrete nonlinear Schrodinger 
(DNLS) equation as the basic modular building block. 
The original DNLS equation [5], [7] shares a long his- 
tory with protein research. The equation was introduced 
by Davidov to explain how an energy excitation prop- 
agates along the a-helix [5j, [S]. The soliton evokes a 
deformation of the protein shape, and as a consequence 
a trapped soliton is a natural cause for protein folding. 
The present generalization of the original DNLS equation 
is motivated by recent observations that protein loops in 
the HP35 villin headpiece with Protein Data Bank (PDB) 
[10] code lYRF [U], and in the myoglobin with PDB 
code lABS y2] are accurately described in terms of its 
dark soliton. In this article we extend this observation 
to essentially all proteins in PDB. We propose to classify 
the shapes of loops in terms of a small number of uni- 
versal parameters that appear in the generalized DNLS 
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equation. These parameters specify the global character- 
istics such as the size and location of a short loop that is 
described by a single soliton. But the detailed shape of 
this loop is entirely determined by the soliton solution. 
Each set of the soliton parameters then corresponds to a 
different short fundamental loop and these fundamental 
loops constitute the modular building blocks of proteins. 

We adopt the present experimental precision as the 
quantitative criterion for identifying two different protein 
structures. The accuracy of x-ray measurements which 
is the dominant approach to structure determination, is 
measured by the B-factor. For very high resolution struc- 
tures the backbone C^ carbons have B-factor values that 
are typically less than [T31 

B^^acc ^ 35 A^ (1) 

According to the Debye- Waller relation this corresponds 
to a fluctuation distance that is less than or equal to 

V^^™a. - « 0.65 A (2) 

Consequently we identify two structures if they deviate 
from each other no more than 0.6 — 0.7 A in RMSD. In- 
deed, when the RMSD value between two loop configura- 
tions is less than this cut-off value, present experimental 
techniques can not reliably differentiate between them so 
that for all practical purposes the two structures are iden- 
tical. Here we show that it is sufficient to introduce only 
200 distinct parameter sets for the soliton, constructed 
using 44 different proteins, in order to describe over 90% 
of known protein structures with the B-factor accuracy. 
Consequently the number of different modular protein 
components appears to be almost an order of magnitude 
smaller than suggested by the present SCOP and CATH 
data. Since the purpose here is to show that we have a 
method that works, we do not aim to optimize the loop 
library. But we suspect that the actual number of truly 
independent loops is much smaller, probably less than 
100. For this we show that the 200 fundamental loops 
can be described by 57 multiple covered loops. 
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II. MODEL 

We characterize the shape of a protein in terms of its 
central Cq, backbone. These carbon atoms are located at 
the positions where i = 1, iV label the residues. For 
each pair of nearest neighbors r^+i and we introduce 
the unit tangent vector and the unit bi-normal vector, 
respectively 



& 



|t,:_i X ti 



(3) 



Then 



tpi = arccos(ti 



& 0j = arccos(bj_|_i • b^) (4) 



are the standard discrete Frenet frame bond angle and 
torsion angle of the backbone. Note that the bond angle 
ipi is determined by three Cq carbons, those at the sites 
Tj, r^+i and ri+2- But for the torsion angle 9i we need 
four Cq, carbons, those between sites i — 1 and i + 2. 
Inversely if the bond and torsion angles are known we 
can reconstruct the entire protein backbone by solving 
the discrete Frenet equation. We refer to [14] for details 
of the present coordinate system. 

An excellent approximation to the standard right- 
handed a-helix and the /3-strand is obtained by setting 



& {^P,,e,)p « (l,7r) (5) 



Similarly, we get the other familiar regular secondary 
structures like 3/10 helices, left-handed helices etc. by 
selecting proper constant values for the bond and torsion 
angles. We also record that the following Z2 transforma- 
tion leaves the backbone coordinates intact [14] 



for all k > i 



(6) 



Loops are configurations that bridge between these reg- 
ular secondary structures. Elsewhere [TT], [T^] it has 
been shown that loops in the chicken villin headpiece 
with PDB code lYRF and the myoglobin lABS can be 
described in terms of the dark soliton of the generalized 
discrete nonlinear Schrodinger equation that derives from 
the energy function 



JV-l 



N 



where {q^ ^^r^VjW) are parameters. Here the first sum 
together with the three first terms in the second sum 
comprise exactly the energy of the standard DNLS equa- 
tion [llj . The fourth (v) is a conserved quantity in the 



DNLS hierarchy [7], called the "helicity". We note that 
the conserved "momentum" could also be added [7] but 
since the improvement in accuracy is minor we leave it 
out. The last [w) is the Proca mass term that we in- 
clude for completeness. In this manner the functional 
form ([7| becomes deeply anchored in the elegant mathe- 
matical structure of integrable hierarchies 7J . But unlike 
e.g. force fields in molecular dynamics, the energy func- 
tion ([7]) does not purport to explain the fine details of 
the atomary level mechanisms that give rise to protein 
folding. Instead, in line with Landau-Lifschitz theories it 
describes the properties of a folded protein backbone in 
terms of universal physical arguments. 

In [11] it has been shown that Q supports solitons. 
For this we first eliminate the variable 9i in terms of 



(8) 



If the value of 9i falls outside of its fundamental domain 
[— TT, tt] we redefine it modulo 2tt. 

We vary the energy function with respect to ipi and 
substitute 9i[^i] from Q to arrive at 

V'.+i-2V'.+V.-i = U'^tp, = {i = l,...,N) 

(9) 

with — ^pN+i = 0. This is a generalization of the 
DNLS equation with 



-(2g/i2 - ^vbe)tp'^ + {q - ^ufee^)-!/)" + 



(10) 



where we recognize the familiar structure of the nonlinear 
Schrodinger equation potential [5]- [5]. Indeed, it turns 
out that in the case of proteins the correction terms give 
rise to an adjustment that is tiny in comparison to the 
B-factor accuracy. 

The exact dark soliton solution to the discrete nonlin- 
ear Schrodinger equation is not known in a closed form. 
But it should be a discrete version of the continuum solu- 
tion, and thus an excellent approximation is obtained by 
naive discretization of the continuum dark NLSE soliton 

mm- 



_ {mi+ 2ttNi) ■ e^i^*-") - (ma + 27riV2) ■ e-"^^'''^ 

gCi(i-s) _|_ p-C2(i-s) 

(11) 

Here s is a parameter that determines the backbone site 
location of the center of the fundamental loop that is de- 
scribed by the soliton. The mi. 2 G [0, tt] are parameters 
that in the continuum limit coincide known combinations 
of the parameters in ( jTo] ) [B]-[S]; in the case of proteins 
their values are entirely determined by the adjacent he- 
lices and strands. The Ni and N2 constitute the integer 
parts of mi 2, initially we take Ni = N2 = N. This in- 
teger is like a covering number, it determines how many 
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times ipi covers its fundamental domain [— 7r,7r] when we 
traverse the loop once. Negative values of ipi are related 
to the positive values by ([6]). Notice that for mi = 7712 
and ci = C2 we recover the hyperbolic tangent. Moreover, 
only the ci and C2 are intrinsically loop specific param- 
eters, they specify the length of the loop and as in the 
case of the mi 2, in the continuum limit they are known 
combinations of the parameters in (10 1. Whenever ipi 
takes values outside of the fundamental domain [— 7r,7r], 
we redefine it modulo 27r. 

A full protein chain is the sum of terms of the form 
(11), over all the locations of the centers of its funda- 



mental loops. 

As a parameter basis for the soliton description of 
loops, we use the parameters in (111, J9|: After deter- 



mining the values of the parameters in ( 11 ), we compute 
the torsion 0i from ([s]) and construct the curve using the 
discrete Frenet equation. Notice that since there are only 
two independent parameters b and e for each fundamen- 
tal loop in ([s]), they are both specified by the regular 
secondary structures that are adjacent to this loop. All 
intrinsic loop dependence is due to ipi. 

Since our aim is to describe protein structures entirely 
in terms of DNLS solitons, hereafter we always define 
helices and strands and other similar regular secondary 
structures strictly in terms of their geometry, by using 
(■0,0) values such as ([S]). The fundamental loops, the 
helices and the strands are then all on similar concep- 
tual footing in the sense that each of these structures are 
specified by two parameters. In particular, a fundamen- 
tal loop coalesces into a helix or a strand at exponential 
rate, when the distance \i — s\ from its center increases. 



III. PARAMETER DETERMINATION 

The challenge we now need to address is to enumerate 
the possible values of the parameters ( 11 ) and (Isl) in case 



of PDB proteins. We determine these parameters using 
th e protein structures in [T5] . We use the li st of proteins 
in'http:/ /bioinfo. tg.fh-giessen.de/pdbselect' dated Febru- 
ary 11, 2011. The structures in this list have a resolution 
better than 3.0 A, R-factor less than 0.3, and less than 
25% homology equivalence. But since our ambition is to 
match B-factor accuracy of 0.6 — 0.7 A we have further 
pruned this list by selecting only those x-ray structures 
that have resolution better than 2.0 A. This leaves us 
with a total of 3.027 proteins. With a very few excep- 
tions the R-factors in our pruned set are less than 0.25, 
and the mean value is R=0.17, see Figure 1. In Table I 
we display the distribution of the residues in our data set 
according to the different secondary structures. 

Our construction of the parameters in (11), ([s]) pro- 
ceeds in three steps: We first use visual inspection and 
RMSD minimization to identify a set of 200 different 
putative fundamental loop structures that describe the 
loops in our list of proteins with different pre-defined ac- 
curacies. We then determine the parameters (11), ((sl) 



Entries 200 
Mean 0.1421 
RMS 0.09181 
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FIG. 1: The distribution of the RMSD distance between the 
original 200 loops and tlieir soliton approximations. 



TABLE I: The total number of residues in our data set and 
their breakdown into different structures according to PDB. 



total 


helices 


strands 


loops 


550.997 


216.732 


140.625 


193.640 



so that the ensuing profiles approximate our 200 visually 
identified loop structures with the RMSD precision of 0.5 
A or better. Finally, we consider various multiple cover- 
ings of the fundamental domain [— tt, tt] of the bond angle, 
to determine a set of integers Ni, N2 in (11). The aim 



is to shrink the set of 200 loop structures into a smaller 
subset that covers the original set with an accuracy that 
exceed 0.5 A in RMSD distance. 

We start our construction by selecting a random pro- 
tein from our list, for example the myoglobin with PDB 
code IA6M. In Table II we present the loop structures 
that we have visually identified in 1A6M. For this we have 
analyzed its profile using the symmetry transfor- 

mation ^ in the manner we have explained in (161 (see 
also Figure 6b). In addition, we list the number of times 
each of the loops appears in our entire data set. For this 
we identify two loops provided they have the same length 
and their mutual RMSD distance is less than 0.5A. 
In the sites for the loop structures that we list in Table 
II, the first and last sites always coincide with values that 
describe known regular secondary structures such as ([5| . 
Consequently for example the loop 18-23 has four sites 
in the loop proper, and the first and last sites 18 and 23 
are in a-helical positions as far as the parameter values 
are concerned. 

It is notable that two pairs of putative loops, the loops 
(77,83) and (81,87) and in particular the loops (95,100) 
and (96,100) are overlapping. In the latter case this is 
because we can introduce two different interpretations: 
We can either interpret (95,100) as a loop that connects 
an a-helix with another a-helix, while by removing the 
site 95 we have a configuration that we can interpret as a 
loop that starts from a /3-strand. A refinement of the cut- 
off RMSD distance 0.5 A to a smaller value might help us 
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TABLE II: The sites of the loop structures ([Tl]), ([8} that we 
identify in 1A6M. Indexing starts from the N terminus. We 
also display the number of matches we have in our data set 
when we use as a cut-off value 0.5 A in RMSD distance. 



Sites 


IMatches 


8-23 


525 


34-39 


702 


41-46 


610 


48-54 


183 


56-61 


819 


77-83 


2 


81-87 


1501 


95-100 


298 


96-100 


2352 


122-127 


287 



to eliminate one of these two loops. However, this would 
be highly questionable as it would also push us below the 
experimental i?-factor accuracy and that does not make 
much sense. We adopt the position that 0.5 A is about 
the best one can do in identifying the fundamental loops, 
with presently available experimental data. 

We continue by selecting a new protein structure. We 
perform the same visual identification of loops. We con- 
tinue the process until we have identified a total of ex- 
actly 200 loops such that each pair of these loops, with 
the same number of sites, has a mutual RMSD distance 
that exceeds 0.5 A. For this we only need to go thru 44 
randomly chosen protein structures in our data set, the 
proteins are listed in Table III. 



1A6M(A) 


20VG(A) 


207A(A) 


IXG(D) 


ILWB(A) 


ISAU(A) 


2I4A(A) 


3GOE 


2AIB 


1P60 


2VZC(A) 


IWMA(A) 


3F1L(B) 


IMUN(A) 


3PD7(B) 


IWKQ(B) 


3E7R(L) 


30Q2(A) 


3BFQ(G) 


ISEN(A) 


1MN8(C) 


3CT6(A) 


2XL6(A) 


3A5F(B) 


3CI3(A) 


3G46(A) 


IZZK(A) 


IPSR(A) 


1127(A) 


IPIX(B) 


2V9V(A) 


2W72(A) 


lOAI(A) 


3DNJ(A) 


INNF(A) 


3LB2(A) 


1Q60(B) 


3P3C(A) 


IQNR(A) 


3L0F(A) 


1D04(A) 


30GN(B) 


3MBX(B) 


2W91(A) 





TABLE III: The PDB codes of the 44 proteins that we have 
used in constructing our loop library (with chain in parenthe- 
ses) 

These 200 loop structures have between 5 and 9 sites, 
including the two end points that are in regular secondary 
structure positions. The distribution of the number of 
loops according to their size is shown in Table IV. 

Loops with length 6 are by far the most common and 
we only identify seven length 8 loops, only one fundamen- 
tal loop with length 9, and none longer. We suspect that 
the very few length 8 loops and the single length 9 loop 
can probably be interpreted as combinations of length 5 
and 6 loops by an extended search of fundamental loops; 
The purpose of the present article is not to develop a pub- 
licly available databank but to form a conceptual basis 



TABLE IV: The distribution of the 200 loop structures ac- 
cording to their length, with the first and last sites in regu- 
lar secondary structure positions. Two loops with the same 
length but separated from each other by more than 0.5 A in 
RMSD distance are considered different. 



Length 


5 


6 


7 


8 


9 


Number 


32 


116 


44 


7 


1 



for developing such a databank by showing that we have 
a method that works. Consequently we have stopped our 
search of new loop structures when we reached exactly 
200 structures. 

In Table V we display how many residues in our entire 
data set are covered by our 200 loops, when we search 
for structures using as a criterion the RMSD distance be- 
tween the structure and a loop. We have performed the 
search with RMSD cut-off values that range from 0.2 A 
to 0.7 A. The largest value 0.7 A is selected to slightly 
exceed the estimate i2l. For a cut-off value of 0.6 A i.e. 



TABLE V: The coverage of our putative loops in terms of 
residues, at difi'erent RMSD cut-off values. Note that a struc- 
ture that has between 5 to 9 sites, has a length that is roughly 
between 20-40 A. 



RMSD cut-off (A) 


Loop sites matched 


< 0.2 


7.208 


< 0.3 


31.655 


< 0.4 


78.561 


< 0.5 


148.267 


< 0.6 


245.954 


< 0.7 


428.387 



just below ([2| the number of sites in configurations that 
are covered by our 200 loops already clearly exceeds the 
total number of sites that are classified as loop sites ac- 
cording to PDB; see Tabic I. This suggests that we cover 
all of the loop structures. However, a closer inspection 
shows that due to overlapping structures the actual cov- 
erage is somewhere around 90%. But when the cut-off 
value reaches 0.7 A we rarely find any loop structures 
that remain uncovered. Since we have a very representa- 
tive data set, this proposes that within the experimental 
B-factor fluctuation distance accuracy ([2]), a large major- 
ity of all loops in PDB, both short and long, are various 
kind of modular combinations of the 200 fundamental 
loops we have identified. 

We now proceed to the second step of our construction. 



Here we search for parameters in the soliton profile (11 1, 
([s]) that describe our fundamental 200 loops, so that the 
RMSD distance between a loop and its soliton is less 
than 0.5 A. Since the RMSD distance between any two 
loop structures in our set of 200 loops is always larger 
than 0.5 A, we demand that the pairwise RMSD distance 
between any two explicit solitons also exceeds 0.5 A. We 
estimate the parameters using a Monte Carlo search that 
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minimizes the RMSD distance between a loop and its 
soliton. The parameter values are summarized in Figures 
2-4: 
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c2 

Entries 200 
Mean 2.602 
0.3685 
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FIG. 2: The distribution of of the parameter value ci and C2 
in (5) in the 200 solitons we have constructed. As expected, 
these two distributions are practically identical. 
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FIG. 4: The distribution of of the parameter values for the 
torsion angle (6) in our 200 solitons. Observe that the pa- 
rameter e clusters in two regions, around -1 and below -10~^. 
Furthermore, the parameter b has very large values, in ex- 
cess of ±10^' and the spread is very large. The fundamental 
region of the torsion angle 9i is [— tt, tt] and the large values 
reveal that as a soliton, the loops cover the spheres [ip, 9) ~ 
several times i.e. each of the loop is a multiple soliton con- 
figuration. This explains why a very regular soliton such as 
(5), (6) can model the apparently highly irregular ipi and Oi 
profiles such as in that we commonly find in PDB. 



In Table VI we show how the number of sites that our 
solitons cover in our full data set depends on the cut-ofF 
RMSD distance, for values between 0.2 and 0.7 A. The 
results are very similar to those in Table V, there is no 
practical difference. We also find that when the RMSD 
cut-ofF value exceeds ^ , the loop structures in our data 
set that are not fully covered by our 200 explicit solitons 
become very rare. Consequently we have succeeded in 
constructing a basis of 200 explicit soliton structures that 
cover most of the PDB loops, apparently over 90% of 
them, and with an accuracy that is comparable to the 
experimental B-f actor accuracy. 



FIG. 3: The distribution of of the parameter values mi and 
m2 in (5) for the 200 solitons. As in Figure 1, the distribu- 
tions are highly symmetric, the difference is not statistically 
meaningful. The a-helices and /3-strands (3) are also clearly 
identifiable in the parameter values. 

For each of the 200 loops, we are able to identify pa- 
rameters so that there is always a soliton profile ( 11 ), ([s]) 
with explicit parameter values, that describes the loop 
with RMSD accuracy that is less than 0.5 A. In fact, 
as shown in Figure 1 the mean RMSD distance between 
the original loop configuration and its explicit soliton is 
a mere 0.14 A, slightly less than the 0.15 A estimate for 
zero point fluctuations in jl7j . At this separation dis- 
tance, it then becomes conceptually meaningless to con- 
sider the two structures as different. 



TABLE VI: The coverage of our explicit soliton configurations 
in terms of residues, at different RMSD cut-off values. 



Cut-off (A) 


Loop sites matched 


< 0.2 


5.954 


< 0.3 


28.399 


< 0.4 


74.037 


< 0.5 


144.683 


< 0.6 


245.257 


< 0.7 


433.737 



Finally, we have constructed our 200 explicit solitons 
by a direct approach, with no attempt for optimization. 
As a consequence we suspect that the number of explicit 
solitons can be substantially decreased without compro- 
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mising the coverage. To show that this is the case we have 
employed the freedom to choose the integers A^i and N2 
in (111 independently. These integers are covering num- 



bers, they determine how many times we cover the fun- 
damental domain if) S [— tt, tt]. They have no effect how 
the parameters mi and m2 determine the asymptotic ij^i 
values. Consequently two solitons that differ from each 
other only by these integers interpolate between regular 
secondary structures with identical tjj values, and in this 
sense they can be viewed as different multiple coverings 
of a single basic soliton with Ni = N2 = 0. But note 
that the Oi values can still be different. 

We proceed as follows: We first select a pair of solitons 
in our library. All the parameters in the first soliton are 
kept fixed. In the second soliton we also keep all param- 
eters fixed, except that we allow the integers Ni and N2 
to vary. We then ask whether it is possible to find a new 
set of integers (TVi, iV2) in the second soliton, so that the 
RMSD distance between the two solitons becomes less 
than 0.5 A. We have found that it is possible to substan- 
tially lower the RMSD distance between two solitons. For 
example, one can find pairs where the initial distance is 
above 3.5 A and this becomes lowered to a mere 0.28 A 
when we judiciously select the integers (iVi, A'2). In this 
way we have been able to show that in our set of 200 
solitons there are only 57 covering solitons that we fail 
to bring to within a distance of 0.5 A from each other. 
But we suspect that in a carefully constructed and op- 
timized library the number of covering solitons is even 
much smaller. 



IV. EXAMPLES 

As an example how the 200 explicit solitons cover our 
data set at different cut-off values, we show in Figure 
5 the typical ipi profile of a protein in PDB, we have 
randomly chosen the one with PDB code IKZQ. This 
protein has 289 residues, the experimental resolution is 
1.7A and the observed R-value is 0.2. In the top Figure 
5 we use the cut-off value 0.3 A to locate our solitons. 
This cut-off value is clearly below the B-factor accuracy 
of the Cq atoms in IKZQ, and our 200 solitons cover only 
around 20 per cent of the loop structures. This coverage 
is consistent with results in Table VI. When we increase 
the cut-off value to 0.5 A (middle Figure 5) most of the 
loops become covered by solitons, and at 0.7A there is 
only one loop with three sites within the loop {i.e. a 
soliton with 5 sites), that does not appear among our 200 
solitons. This loop can be modeled by a single soliton and 
the soliton can be added to our initial unpruned library 
if so desired, increasing the number of solitons to 201. 
Alternatively, we could try to describe it as a multiple- 
covering of one of our 57 solitons. Notice that in addition 
there are four isolated sites where the deviation exceeds 
the cut-off value of 0.7 A. Indeed, it is not too exceptional 
for proteins that are resolved with this resolution to have 
individual Cq. sites where the experimental accuracy as 



measured by the B-factor fluctuation distance exceeds 0.7 
A. These low-resolution Cq. carbons commonly become 
visible in our matching procedure, and this could be used 
to identify potential problems in data. 



FIG. 5: An example how our 200 explicit solitons cover the 
protein with PDB code IKZQ in our data set, in terms of 
the bond angles tpi. We use cut-off values 0.3 A (top), 0.5 
A (middle) and 0.7 A (bottom). Red dots and lines corre- 
spond to sites and structures that are described by the soli- 
tons with the cut-off accuracy or better, while black dots and 
lines correspond to sites where the local distance exceeds the 
cut-off value; isolated black dots indicate local fluctuations in 
B-factors. Three or more consecutive black dots indicate the 
presence of a loop that is not covered by our 200 solitons. 
Note that at resolution 0.7 A (bottom) there is only one such 
loop. 

As a second example we discuss a loop in the protein 
with PDB code 3DLK. In [TT] we showed how to con- 
struct a soliton that describes the super-secondary struc- 
ture that is located between the coordinate sites 398-416 
in the A chain of 3DLK, with RMSD accuracy 1.13 A. 
The structure describes a loop that connects an a-helix 
to a /3-strand. We now analyze this loop in terms of 
our library of 200 solitons. In Figure 6a we display the 
(ipijOi) profile around the loop region; We remind that 
according to ([s]), Q the ipi is determined by the three 
coordinate sites i, i + 1 and i + 2 while 6i is determined 
by the four sites with indices from i — I to i + 2. 

There is a relatively large local fluctuation at the co- 
ordinate site z=404, according to the PDB data the B- 
factor of the Cq atom at this site is 40.0 (A^) which is 
clearly above ([l]). The B-factors at the coordinate sites 
403 and 405 are also relatively high, with values 33.5 and 
33.5 respectively. But beyond the coordinate site 405, 
the B-factors are around 25-30 that is the Debye- Waller 
fluctuation distances are below 0.7 A for the sites that 
we have displayed in Figure 6. In Figure 6c the top (red) 
line shows the fluctuation distances for the coordinate 
sites 405-414. 

In Figure 6b we display the profile of -0^, after we have 
implemented the transformation ([6 1 . We clearly identify 
two soliton profiles (111. Due to the relatively large B- 



factor at coordinate site 404, we try and take the first 
soliton to start from the bond angle site 405. The defi- 
nition of this bond angle is independent of site 404, and 
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FIG. 6: Figure a) shows the PDB data for the bond and 
torsion angles for the monomer A in 3DLK, for sites 398-414 
Figure b) displays the tl^i profile, after we have introduced the 
gauge transformation Q. The two solitons are clearly visible 
between sites 405-412. In Figure c) we compare the B-factor 
of the 3DLK (upper line in red) with the distance between its 
backbone and the two-soliton configuration (lower line with 
black). The shaded area is the 0.15 A fluctuation regime 
around the soliton. In Figure d) we compare the {tf}i,9i) distri- 
butions for the PDB data (black and red) and the two-soliton 
configuration (blue and green). 

thus we are optimistic that we do not need to compro- 
mise with our ambition to exceed the B-factor accuracy 
([T]) in our loop description. The second soliton ends at 
bond angle site 412, and the two solitons overlap between 
the bond angle sites 407 and 409. When we search for 
similar structures among our 200 solitons, we find two 
profiles that in combination match the loop. The first 
soliton covers the coordinate sites 405-410, and the sec- 
ond soliton covers the coordinate sites 409-414. In terms 
of the bond angles, together they cover the sites 405-412. 
When we combine these two solitons so that they match 
each other as accurately as possible at their common co- 
ordinate sites 409 and 410, we find a two-soliton config- 
uration that describes the protein loop for residues 405- 
414 with a RMSD accuracy of 0.31 A. The (lower) black 



line in Figure 6c shows the difference between the PDB 
structure and the two-soliton structure. This difference 
is clearly less than the Debye- Waller B-factor distance, 
at every site. The shaded area describes the zero point 
fluctuation regime around the solitons. We have followed 
[17j to estimate that the zero point fluctuations have an 
amplitude that is no larger than 0.15 A. Finally, in Fig- 
ure 6d we compare the ipi and 6i values of the PDB data 
and the two-soliton configuration. There is essentially no 
difference. 



V. CONCLUSION 

Protein loops remain a major challenge both in struc- 
ture classification and prediction. Loops are commonly 
viewed as apparently random regions with no regular self- 
similar structure. Here we have shown that loops are 
not random at all. Their shape is fully determined and 
with experimental B-factor accuracy by the dark soliton 
solution of a generalized discrete nonlinear Schrodinger 
equation that has only two loop specific parameters. In 
particular we have found that the number of different 
parameter sets i.e. fundamental loops appears to be no 
more than 200 and probably it is even smaller than 57 if 
we allows for multiple coverings. When the fundamental 
loops together with the helices and strands are at our dis- 
posal, the construction of entire folded proteins becomes 
like a play with Lego bricks. We can build the entire pro- 
tein from these modular components by simply putting 
them together, one after another. Moreover, our quan- 
titative approach is firmly grounded on a Physics based 
energy function. This should enable energetic analyses 
of protein folding, and energy comparisons between folds 
and misfolds. We propose that our soliton approach to 
protein folding can add a powerful component to the ex- 
isting classification and modeling schemes. 
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